INTRODUCTION

North Carolina’s juvenile arrest records occupy a complicated position with regard to public availability and legal protection. Due to the implementation of the “Raise the Age” reform in 2019, most individuals aged 16 and 17 are subject to juvenile rather than adult court proceedings, thus rendering their personal details – including age and demographic – confidential. This transition is reflected in Chapel Hill’s latest arrest statistics through subtle, yet considerable means: age fields are commonly redacted, thereby impeding a full understanding of the arrestees and their locations. Simultaneously, arrest patterns, encompassing both time of day and seasonal variations, imply potentially wider cyclical trends in criminal activity across neighborhoods. Collectively, these trends prompt pertinent inquiries regarding the confluence of temporality, spatiality, and confidentiality within Chapel Hill, a smaller municipality where public safety imperatives, university operations, and juvenile safeguarding converge.

Prompted by this complexity, our group concentrated on a dataset of 37,310 arrests effectuated by the Chapel Hill Police Department from 2010 to 2024. From this, we developed two central questions. Firstly, is it possible to forecast the hour of the day with the highest probability of arrests within particular Chapel Hill zip codes, conditional on seasonal variations, semester status, arrest type, and demographic factors? The importance of this question lies in the fact that understanding the temporal and spatial distribution of arrests could enable the police department to optimize resource deployment; for instance, by positioning officers in proximity to nightlife districts during late weekend evenings or around university campuses during stressful periods such as examination weeks. It provides a more lucid depiction of communal patterns and vulnerable periods for the citizenry, thereby assisting both inhabitants and scholars in conducting their daily affairs with greater security.

Secondly, are arrests with redacted age data spatially clustered, and can environmental or institutional factors, such as proximity to educational institutions or central business district patrol zones, explain these patterns? Given the rise in redactions due to evolving juvenile privacy statutes, spatially mapping the locales of these obscured-age arrests could yield significant understanding of the spatial dynamics of youth policing. Understanding these redaction patterns can assist data owners and policymakers in promoting transparent reporting and informing reforms intended to balance public accountability with youth protection.

Collectively, these inquiries convert arrest records from inert data into active narratives, which can educate public policy, aid more intelligent policing, and inspire more profound dialogues regarding justice, privacy, and location within Chapel Hill.

DATA

The data used in this project originates from the Chapel Hill Police Department’s arrest logs, which are made publicly available through the data.gov website. While the data was retrieved from this open-source platform, it is originally collected and maintained by the Chapel Hill Police Department. Each observation in the dataset represents an individual arrest event, with information about when and where it occurred, details of the arrest, and key demographic characteristics of the arrested individual. This dataset is not a random sample but rather a semi-comprehensive record of arrest incidents in Chapel Hill from 2010 to 2024, with the exception of several months in 2021 (we have not recieved a response from the CHPD database manager on why this is). After cleaning and filtering, our working dataset contains 37,310 observations, each corresponding to a single arrest. The following table is a representation of the most important variables provided in our data:

Arrest_Date Street Arrest_Type Drugs_Alcohol Age Gender Race Disposition Latitude Longitude
2016-04-23 13:32:00 137 E FRANKLIN ST TAKEN INTO CUSTODY (WARRANT/LP) N 47 M NA CLEARED BY ARREST 35.91 -79.05
2023-02-19 23:45:00 104 BILLIE HOLIDAY CT ON VIEW N 38 M B CLEARED BY ARREST 35.95 -79.05
2017-05-04 10:53:00 2701 HOMESTEAD RD TAKEN INTO CUSTODY (WARRANT/LP) N 23 F NA CLEARED BY ARREST 35.95 -79.06
2013-08-25 01:22:00 PRITCHARD AVE. @ CARR ST. SUMMONED/CITED NA 20 M NA CLEARED BY ARREST 35.92 -79.06
2014-10-25 07:00:00 125 E ROSEMARY ST SUMMONED/CITED N 44 M B CLEARED BY ARREST 35.91 -79.06

After doing an exploratory data analysis, we found two trends to investigate further. Firstly, we observed a wide and bimodal distribution of arrest times throughout the day, with distinct peaks around midnight that showed patterned behavior. This pattern varied by day of the week, by location, and by the nature of the arrest itself. These patterns motivated us to build models that predict the “Hour of Arrest” based on contextual variables.

The second trend we noticed was a sizable number of arrest records missing age data. Many of the arrests with missing age data also had missing demographics, such as race, gender, and ethnicity. These arrests were not evenly distributed across Chapel Hill but instead clustered in specific geographic areas. In particular, the police headquarters and East Chapel Hill High School had over 110 arrests each. From this trend we hypothesized that the arrests with unknown ages were those of minors, and the redaction of identifying information was done to protect them. The figure below shows the geographic distribution of arrests with unknown age, larger circles represent more arrests at a location.

To support our analysis, we engineered several notable variables from the original data. From the “Arrest Date”, we extracted the “Hour of Arrest”, “Day of the Week”, “Month”, “Season”, and “Academic Semester” (Spring, Summer, Fall, or Break), based on the University of North Carolina at Chapel Hill’s (UNC-CH) academic calendar. A binary indicator (“Franklin”) was created to identify whether the arrest occurred on Franklin Street, a busy street in downtown Chapel Hill that exhibits measurably higher arrest activity. We also included variables for “Zip Code”, “Latitude”, “Longitude”, and demographics such as “Age”, “Gender”, and “Race”. Underage status was determined by identifying rows where “Age” was missing (as these records correspond to individuals under 18 whose age was withheld). “Disposition” represents the outcome of an arrest and was used in our analysis of Question 2. Variables unrelated to our explorations (such as “Arrest ID”) were excluded.

RESULTS

Question 1: Can We Predict the Hour of Day When Arrests Are Most Likely to Occur?

To answer this question, we developed and compared five different models to predict the hour of arrest. These included:

  1. Linear Regression: Hour ~ Zip + Month + Day + Season + Arrest_Type + Drugs_Alcohol + Semester + DayOfWeek + IsWeekend
  2. K-Nearest Neighbors (KNN): Hour ~ Zip + Month + Day + Season + Arrest_Type + Drugs_Alcohol + Semester
  3. Random Forest - Simple Version: Hour ~ Month + Day + Arrest_Type + Drugs_Alcohol + Semester + DayOfWeek + Franklin
  4. Random Forest - Base Version: Hour ~ Zip + Month + Day + Season + Arrest_Type + Drugs_Alcohol + Semester + DayOfWeek + latitude + longitude
  5. Random Forest - Full Version (with added features): Hour ~ Zip + Month + Day + Season + Arrest_Type + Drugs_Alcohol + Semester + DayOfWeek + Latitude + Longitude + Year + Age + Franklin + Gender + Race

We explored three modeling approaches to predict the hour of arrest: linear regression, K-Nearest Neighbors (KNN), and Random Forest. Linear regression served as a baseline model to establish a benchmark for model comparison. Due to the non-linearity of our variables we did not expect the linear regression to be successful in predicting the hour accurately. Secondly, we implemented a K-Nearest Neighbors (KNN) model, to predict arrest hour based on the average of the most similar observations. KNN models do well with nonlinearity, but they can struggle with imbalanced data. Finally, we applied a Random Forest model, which is a type of machine learning model that builds many individual decision trees and combines their results to make more accurate and stable predictions. Each tree in the forest looks at a random subset of the data, similar to cross validation. We chose to use 100 trees in our models to reduce their reactivity to noise produced by the amount of variables being analyzed. We chose this model for our data in particular because it can handle a high number of variables of different types, so the more variables it has the better it will do at predicting, which is not always true for other models. The variables present in each model were selected based on their observed relevance in our exploratory analysis. The graphs below show each model’s predictions compared to the actual hour.

All models were evaluated using Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) to assess their predictive accuracy.

Model MAE RMSE
Random Forest - Full 2.737512 3.560415
Random Forest - Base 3.401371 4.280269
KNN 5.799254 6.940523
Random Forest - Simple 6.047140 7.001957
Linear 6.779104 7.705276

The best-performing model was the Random Forest - Full, which clearly demonstrates that the Random Forest machine learning model improves with more data input. Every model tends to over-predict after midnight and under-predict before midnight. A glance at the residuals (actual - predicted hour) confirms this:

While predictions were relatively accurate around midday, model performance declined significantly during the late-night hours. This is an issue with the linearity of our Hour variable which represented hour 0 (12 AM) and hour 23 (11 PM) as 23 hours apart, despite being only one hour apart in reality. Time of day is inherently circular, not linear. To address this, we transformed the Hour variable using sine and cosine functions to capture its circular nature. This places each hour on the unit circle, preserving its cyclical structure.

\[ \sin\left(\frac{2\pi \cdot \text{Hour}}{24}\right), \quad \cos\left(\frac{2\pi \cdot \text{Hour}}{24}\right) \]

We then trained two new Random Forest models using these transformed values: one to predict the sine of the hour, and another to predict the cosine of the hour. We used the same variables as Random Forest - Full to give our machine learning the most data to make the best predictions.

  1. Sine Hour = Zip + Month + Day + Season + Arrest_Type + Drugs_Alcohol + Semester + DayOfWeek + Latitude + Longitude + Year + Age + 7. Franklin + Gender + Race
  2. Cosine Hour = Zip + Month + Day + Season + Arrest_Type + Drugs_Alcohol + Semester + DayOfWeek + Latitude + Longitude + Year + Age + Franklin + Gender + Race

Once both models generated predictions for sine and cosine values, we reconstructed the predicted hour using the arctangent function, mapping it back to the appropriate angle on the unit circle.

\[ \text{Hour} = \left( \frac{\text{arctan}(\text{Sine Hour}, \text{Cosine Hour}) \cdot 24}{2\pi} \right) \bmod 24 \]

The result is the model below:

This model shows clear improvement at handling times around midnight although it hasn’t completely eliminated the over and under prediction. To evaluate how redefining time improved our model we compared the error to the previous best model based on a linear time representation.

Model MAE RMSE
Random Forest - Circular 2.303352 6.077466
Random Forest - Full 2.737512 3.560415

To further evaluate prediction quality, we visualized residual distributions. In models using linear time, residuals showed clear patterns near the edges of the clock. After applying the circular time model, the residuals were more evenly distributed, indicating a better model fit across the entire 24-hour cycle.

The residuals now follow a sinusoidal pattern centered around zero, rather than a linear distribution. Notably, there are clusters of extreme underpredictions and overpredictions near midnight. These occur because predictions that are close to midnight must be converted back to a 24-hour scale for visualization, which creates the illusion of large errors. However, in a circular representation of time, these values are actually quite close to the true values, so these outliers can largely be ignored.

The resulting model is a relatively accurate prediction of the hour of day an arrest will occur based on the factors of that arrest. This model has practical value for both the Chapel Hill Police Department and the UNC-CH student body. For the police, models like this can inform resource allocation around the town, enabling officers to be strategically positioned during high-risk hours. Arrests and crimes with similar characteristics can be modeled to windows of time so that officers know what to watch out for at different times of day. For students, this information can inform risk-taking behaviors, influence safer daily routines, and increase awareness about hours requiring increased vigilance. To make this model more useful, future work can incorporate additional variables such as campus events, holidays, patrol routes, etc. that may impact arrest patterns. Before this model can be used in the real world, it is essential that it be evaluated for fairness across demographic groups to ensure it doesn’t reinforce harmful biases.

CONCLUSION

We explored two central questions: (1) “Can we predict the hour of day when arrests are most likely to occur in Chapel Hill based on features such as location, time, and demographic factors?” And (2) “Are underage individuals treated differently than adults in terms of arrest type and case disposition?”

For the first question, we developed multiple predictive models, ultimately finding that a Random Forest model using a circular transformation of time (via sine and cosine) was the most accurate and precise. Some model limitations remained, particularly around late-night hours; the circular approach improved residual patterns and reduced bias near midnight. By identifying consistent patterns in when and where arrests occur, police departments can more effectively allocate officers appropriately, depending on peak hours, such as late nights on weekends near Franklin Street. Our models found that features such as semester status and day of the week meaningfully improved prediction accuracy, suggesting that events tied to the academic calendar influence arrest timing. This opens conversations for university administrators to schedule targeted safety communications, coordinate mental health resources, or determine necessary campus patrol procedures. Rather than reacting to incidents, both law enforcement and university leadership can proactively plan around predictable arrest patterns, with the goal of improving public safety while reducing unnecessary interventions.

In the second part of our project, where we used missing age fields as a proxy for underage status, we found that these individuals were more often cited than detained and were slightly more likely to have their cases formally cleared. To identify minors in the dataset, we utilized the absence of age values, as mandated by North Carolina law, which shields juvenile information. We analyzed two dimensions of differential treatment: case disposition and arrest type. Our results suggest that law enforcement treats minors differently in ways that are likely influenced by both legal requirements and departmental procedures dealing with youth. These findings are likely a result of a combination of juvenile justice policy and officer decision-making practices aimed at minimizing harm and legal complexity in youth cases. Working with indirect data poses a challenge, but we uncovered consistent trends in how youth are processed. For policymakers, even though Raise the Age legislation restricts access to juvenile records, patterns in redacted data, such as clusters of missing age fields and citation-heavy arrest types, still offer insight into how minors are policed and processed. This could inform future reforms, such as clearer standards for citations versus detainments. Findings also highlight where youth policing is concentrated, raising questions about whether certain locations are disproportionately represented. For families and educators, understanding how youth are handled by law enforcement can help support community outreach, legal education, and effective policing practices.

Our findings highlight how publicly available arrest records can provide valuable insights into the timing and nature of local law enforcement practices. Future research should incorporate variables such as campus events, patrol routes, arrest severity, recidivism rates, or social event data to enhance the accuracy of the predictive model. Exploring equity across intersections of race, gender, and the socio-spatial context of arrests would also help ensure ethical and responsible use of the data. Additionally, gaining access to more detailed disposition data and complete age information would enable a more nuanced understanding of how policing practices and judicial law affect youth populations. Our dual analysis highlights both the when and how of arrests in Chapel Hill. By combining machine learning and policy analysis, these insights have the potential to inform policy, enhance safety strategies, and foster discussions about equity and privacy in public policing.